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Abstract 

It is now known that an extended Gaussian process model equipped with rescal- 
ing can adapt to different smoothness levels of a function valued parameter in many 
nonparametric Bayesian analyses, offering a posterior convergence rate that is optimal 
(up to logarithmic factors) for the smoothness class the true function belongs to. This 
optimal rate also depends on the dimension of the function's domain and one could 
potentially obtain a faster rate of convergence by casting the analysis in a lower di- 
mensional subspace that does not amount to any loss of information about the true 
function. In general such a subspace is not known a priori but can be explored by 
equipping the model with variable selection or linear projection. We demonstrate that 
for nonparametric regression, classification, density estimation and density regression, 
a rescaled Gaussian process model equipped with variable selection or linear projection 
offers a posterior convergence rate that is optimal (up to logarithmic factors) for the 
lowest dimension in which the analysis could be cast without any loss of information 
about the true function. Theoretical exploration of such dimension reduction features 
appears novel for Bayesian nonparametric models with or without Gaussian processes. 

Keywords. Bayesian nonparametric models. Posterior convergence rates, Gaussian pro- 
cesses. Dimension reduction, Nonparametric regression and classification, Density es- 
timation and regression. 

1 Introduction 



Gaussian processes are widely used in Bayesian analyses for specifying prior distributions over 
function valued parameters. Examples include spatio-temporal modeling (Handcock and Stein, 
1993; Kim et al., 2005; Banerjee et al., 2008), computer emulation (Sacks et al., 1989; Kennedy and O'Haga 
2001; Oakley and OHagan, 2002; Gramacy and Lee, 2008), nonparametric regression and 
classification (Neal, 1998; Csato et al., 2000; Rasmussen and Williams, 2006; Short et al., 
2007), density estimation (Lenk, 1988; Tokdar, 2007), density and quantile regression (Tokdar et al., 
2010; Tokdar and Kadane, 2011), functional data analysis (Shi and Wang, 2008; Petrone et al., 
2009) and image analysis (Sudderth and Jordan, 2009). Rasmussen and Williams (2006) 
give a thorough overview of likelihood based exploration of Gaussian process models, includ- 
ing Bayesian treatments. 
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Theoretical properties of many Bayesian Gaussian process models have been well re- 
searched (see Tokdar and Ghosh, 2007; Choi and Schervish, 2007; Ghosal and Roy, 2006; 
van der Vaart and van Zanten, 2008, 2009; de Jonge and van Zanten, 2010; Castillo, 2011, 
and the references therein). In particular, van der Vaart and van Zanten (2009) present a 
remarkable adaptation property of such models for nonparametric regression, classification 
and density estimation. They show a common Gaussian process (CP) prior specification 
equipped with a suitable rescaling parameter offers posterior convergence at near optimal 
minimax asymptotic rates across many classes of finitely and infinitely differentiable true 
functions. The rescaling parameter is a stochastic counterpart of a global bandwidth param- 
eter commonly seen in smoothing-based non-Bayesian methodology. However, a single prior 
distribution on the rescaling parameter is enough to ensure near optimal convergence across 
all these classes of functions. 

In this article we explore additional adaptation properties of CP models that are also 
equipped with variable selection or linear projection. To appreciate the practical utility of 
this exercise, consider a nonparametric (mean) regression model Yi = f{Xi) + C,i, ^ > 1, where 
Xj's are dimensional and ^j's are independent draws from a zero mean normal distribution. 
When / is assigned a suitable GP prior distribution equipped with a rescaling parameter 
and the true conditional mean function /o is Holder a smooth (Section 2.1), the posterior 
distribution on / converges to /o at a rate n""/'-^"'"'"'^'' (logn)'^. This rate, without the logn 
term is optimal for such an /o in a minimax asymptotic sense (Stone, 1982). Now suppose 
fo{Xi) depends on Xi only through its first two coordinates Zj. If this information was 
known, we could cast the model as Yi = g{Zi) +C,i and assign g with a GP prior distribution 
with rescaling to obtain a faster convergence rate of n~"/^^""'~^'*(logn)*^^. If in addition we 
knew that fo{Xi) depends only on the difference Ui of the first two coordinates of Xi, then 
we would instead cast the model as Y^ = h{Ui) +^i and with a rescaled GP prior on h obtain 
an even faster convergence rate of n~'^/^'^"^^^'> {lognY'^ . 

In practice, we do not know what sort of lower dimensional projections of Xi perfectly 
explain the dependence of fo{Xi) on Xi. But this could be explored by extending the GP 
model to include selection of variables (Linkletter et al., 2006) or linear projection onto lower 
dimensional subspaces (Tokdar et al., 2010). The questions we seek to answer are as follows. 
Do GP models equipped with rescaling and variable selection offer a posterior convergence 
rate of n~°'^^'^"^'^^\\ogn)''^ when the true / is a Holder a-smooth function /q that depends 
only on di < d many coordinates of its argument? More generally, do GP models equipped 
with rescaling and linear projection offer a posterior convergence rate of n~°'^^'^°'~^'^°\\ogn)'''^ 
when the true / is a Holder a-smooth function such that foi^i) depends on a rank-cio 
linear projection of Xj? We demonstrate the answer to either question to be affirmative for 
extensions of the so called square exponential GP models in nonparametric mean regression, 
classification, density estimation and density regression. 

Although projection or selection based dimension reduction is routinely employed in a va- 
riety of Bayesian nonparametric models with or without the use of Gaussian processes (see for 
example, Rodriguez and Dunson, 2011), their theoretical implications have not been fully ex- 
plored. Best results so far demonstrate posterior consistency (Tokdar et al., 2010; Pati et al., 
2011), which already holds without these advanced features. Our results indicate that there 
is indeed an added advantage in terms of possibly faster posterior convergence rates. These 
results, with necessary details are presented in Section 2, which, we hope, can be appreci- 
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ated by all readers interested in Bayesian nonparametric models with or without technical 
knowledge about Gaussian processes. Section 3 presents a set of deeper and more fundamen- 
tal results, with non-trivial extensions of results presented in van der Vaart and van Zanten 
(2009). However, we have tried our best to make our calculations easily accessible to other 
researchers interested in studying extensions of GP models with additional adaptation fea- 
tures. We conclude in Section 4 with remarks on density regression versus density estimation 
and on a recent, unpublished work on a similar topic by Bhattacharya et al. (2011). 

2 Main results 

2.1 Extending a rescaled GP with variable selection or projection 

We will restrict ourselves to nonparametric models where a function valued parameter /, 
to be modeled by a GP or its extensions, is defined over a compact subset of M.'^ fom some 
d. Without loss of generality we can assume this set to be equal to U^, the unit disc 
{x G M'^ : < 1} centered at the origin. If the actual domain of / is not elliptic, such as 
a rectangle giving bounds on each coordinate of the argument x, we will simply shift and 
scale it to fit inside U^. Working on the larger domain poses no technical difficulties. 

Let W = {W{t) : t e M'^) be a separable, zero mean Gaussian process with an isotropic, 
square exponential covariance function E{Vr(i:)H^(s)} = exp(— ||t — s||^). For any a > 0, 
b e {0, l}"^ and q G Oa, define W'^''^''^ = {W''^''''3{x) : a; G U^) by 

H^«.''.9(x) = W{diag{ab) ■ qx), (1) 

where for any vector v, diag(f ) denotes the diagonal matrix with the elements of v on its 
diagonal. Note that W°''^''^{x) = W"''^''^{z) if and only if Rx = Rz where R is the orthogonal 
projection matrix g'diag(6)g. Therefore the law of W°''^''^ defines a probability measure on 
functions / : — ?■ M such that f{x) depends on x only through the projection R. Also 
note that with q = Id, the d-dimensional identity matrix, R simply projects along the axes 
selected by b. 

Let \b\ denote the number of ones in a 6 G {0, l}'^. Suppose {A, B, Q) are distributed as 

{B,Q) ^TVBXTiQ, A^^B = b,Q) ^ Ga{ai,a2), (2) 

independently of W, where ai > 1, 02 > 0, ttb is a strictly positive probability mass function 
on {0, 1}*^ and ttq is a strictly positive probability density function on Od- When B = 0, 
we simply take A to be degenerate at 1. The law of the process W^'^''^, which extends 
the square exponential GP law by equipping it with rescaling and linear projection, will 
be denoted G'P/j>(ai, 02, tt^, ttq). Similarly, the law of the process W^'^'^'^, which extends 
the square exponential GP law by equipping it with rescaling and variable selection, will be 
denoted GPvs{ai,a2,nB)- 

In the sequel, a function / : f/ — )■ M defined on a compact subset U of M"^ is called Holder 
a smooth for some a > if it has bounded continuous derivatives (in the interior of U) up 
to order [aj, the largest integer smaller than a with all its [aj-th order partial derivatives 
being Holder continuous with exponent no larger than a — \_a\ . 
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2.2 Mean regression with Gaussian errors 



Nonparametric regression of a response variable on a vector of covariates with Gaussian 
errors comes in two flavors, depending on how the design points, i..e, the covariate values 
are obtained. They could either be fixed in an experimental study or measured as part of 
an observational study. The notion of posterior convergence differs slightly across the two 
contexts, a brief overview is given below. 



Fixed design regression. Suppose real-valued observations Yi,Y2, ■ ■ ■ are modeled as Yi = 
f{xi)+^i for a given sequence of points Xi, X2, ■ ■ ■ from U^, with independent, A''(0, cr^) errors 
^1, (^2, ■ ■ ■ • Assume (/, a) is assigned a prior distribution Ilf^a{df, da) = Ilf{df) x 7i„[da) where 
Uf is either GPvs{ai,a2,7iB) or GPlp{ai,a2,'n'B,'^Q) and tTo- is a probability measure with a 
compact support inside (0, oo) and has a Lebesgue density that is strictly positive on this 
support. 

Let IIj^ denote the posterior distribution of (/, o") given only the first n observations 
' ' ' 1 Y^i, i.e., 

n„ . , ^ ^ a-"exp{-^ j:i,iY, - fjxm'^fAdf, da) 
^ Icr~-eM-^Et,iY,-fixm^fAdf,da)- 

For every n > 1, define a design-dependent metric || ■ ||„ on MP'^ as ||/ — (^ll^ = ^ Y17=iif i^i) ~ 
g{xi))^. Let (e„ : 77, > 1) be a sequence of positive numbers with lim^^ooCn = and 
lim„^oo''^en — For any fixed /o : — )■ M and (Tq > we say the posterior converges at 
(/o, (To) at a rate e.„ (or faster) if for some M > 0, 

plimn^^,({(/, a) : \\f - /o||„ + \a - a\ > Me^}) = 

n—^oo 

whenever Y^ = foixi)+C,i with independent ~ A^(0,(To). Here "plim" indicates convergence 
in probability. 



Random design regression. In the random design setting we have observations {Xi,Yi) G 
?7rf X M, i = 1, 2, ■ ■ ■ which are partially modeled as Yi = f{Xi) + C,i with N{0,a'^) errors 
^1, ^2, ■ ■ ■ • The design points Xi,X2, ■ ■ ■ are assumed to be independent observations from 
an unknown probability distribution on Ud and Xj's and ^j's are assumed independent. 
However inference on G^ is not deemed important. Assume (/, a) is assigned a prior distri- 
bution Hf^a as in the previous subsection and the corresponding posterior distribution based 
on the first n observations (Xi, Fi), ■ ■ ■ , (X„, Yn) is denoted H^^^. 

Let II ■ II denote the L2-metric with respect to G^;, i.e., ||/— (7110^ = J {f{x)—g{x))^Gx{dx). 
Consider a sequence (e„ > : n > 1) as before. For any fixed /o : — )■ M and ctq > we 
say the posterior converges at (/o, o"o) at a rate e„ (or faster) if for some M > 0, 

plimH;;^^({(/, a) : ||/ - /o||g. + W - a\ > MeJ) = 

n— >oo 

whenever Yi = fo{Xi) + with (^j, Xj) ~ A^(0, a^) x G independently across i > 1. 

Note that in either setting convergence at (/o, o"o) at a rate e„ also implies convergence 
of the (marginal) posterior distribution on / to /q at the same rate (or faster). For either 
setting we can state the following dimension adaptation result. 
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Theorem 1. Assume fo : —> is a Holder a-smooth function on Ud and (Tq > inside 
the support of n^. If fa^x) depends on x only through di < d many coordinates of x and 
lif = GPvs{ai,a2,TiB) , the posterior converges at (/o, Co) at a rate e„ = n~"/(^"+'^i)(logn)*^ 
for every k > d+1. Furthermore, if a > 1, fo{x) depends on x only through a rank-do linear 
projection Rx and Uf = GPlp{ai,a2,TTB,''^Q), the posterior converges at (/o,cro) at a rate 
en = n~°/'^^""'""'''^(log?T,)'^ for every k > d + 1. 

2.3 Classification 

Suppose observations G x {0,1}, i = 1,2, ■■■ are (partially) modeled as Yi ~ 

5er($(/(Xj))), independently across i, where $ is the standard normal or the logistic cu- 
mulative distribution function, with Xj's assumed to be independent draws from a proba- 
bility distribution on U^^. Assume / is assigned a prior distribution Uf which is either 
GPvs{ai,a2,nB) or GPlp{ai,a2,nB,'n'Q), and let 11^ denote the corresponding posterior dis- 
tribution based on the first n observations (Xi, Yi), ■ ■ ■ , i.e., 

^ nuiiHfix,))y'{i-Hf{x^,))y-y^]Uj{dfy 

Consider a sequence (e„ > : n > 1) as before. For any fixed /o : — M and (Tq > we 
say the posterior converges at /o at a rate e„ (or faster) if for some M > 0, 

plimn;;({/: ||/-/o||g., >Me„}) = 

n— >oo 

whenever Yi\Xi ~ i?er(<l>(/o(Xj))) and Xj ~ G, independently across i > 1. 

Theorem 2. Let /o : — M 6e a Holder a-smooth function on Ud- If fo{x) depends 
on X only through di < d many coordinates of x and lij = GPvs{ai,a2,'n:B) , the pos- 
terior converges at /q at a rate e„, = n~°/^^""^'^^^(logn)'^' for every k > d -\- 1. Further- 
more, if a > 1, fo{x) depends on x only through a rank-do linear projection Rx and 
Uf = GPlp{ai, a2,TiB,'^Q) , the posterior converges at /o at a rate e„ = n~°'/^'^°'^'^'^^(\ognY 
for every k > d+1. 

2.4 Density or point pattern intensity estimation 

Consider observations Xi G U^, i = 1, 2, ■ ■ ■ modeled as independent draws from a probability 
density g on that can be written as 

gl^^^ ^ 9*{x)exp{f{x)) 

Iv^9*{z)exp{f{z))dz 

for some fixed, non-negative function g* and some unknown / : — t- M. This type of models 
also arise in analyzing spatial point pattern data with non-homogeneous Poisson process 
models where the intensity function is expressed as 5'*(x) exp(/(x)). Assume / is assigned 
the prior distribution Uf which is either GPvs{ai, a2, txb) or GPlp{ai, a2, ttb, ttq), and let Ug 
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denote the induced prior distribution on g. The corresponding posterior distribution based 
on Xi , ■ ■ ■ , Xn is given by 



Let h{gi,g2) = {jT^^{gl^^{x) — gl^'^{x)ydxY^'^ denote the HeUinger metric. 

Consider a sequence (e„ > : n > 1) as before. For any fixed density go on U,^, we say 
the posterior converges at go at a rate e„ (or faster) if for some M > 0, 

phmn^({(?:/i((7,^?o)>Me„}) = 

n—^oo 

whenever X^'s are independent draws from gQ. 

Theorem 3. Let go be a probability density onUd satisfying go{x) = g*{x)e^°^^y J g*(z)e^°^'^^dz 
for some Holder a-smooth fo : Vd ^ If foi^) depends on x only through di < d 
many coordinates of x and Uf = GPvs{ai,a2^HB), the posterior converges at go at a rate 
e„ = 77,""/*^^""^'^^'' (log n)'^ for every k > d + 1. Furthermore, if a > 1, fo{x) depends on x 
only through a rank-do linear projection Rx and Uf = GPlp{ai,a2,7TB,'^Q), the posterior 
converges at go at a rate e„ = n~"'/^'^"'^^°\logn)'' for every k > d + 1. 

In the above theorem, the conditions on /o are equivalent to saying that go{x)/g*{x) 
varies in x only along a linear subspace of the variable x. In the context of two dimensional 
point patter models, this implies that the intensity function, relative to g*, is constant over 
the spatial domain or constant along a certain direction. 



2.5 Density regression 

Consider again observations {Xi,Yi) G x M, i = 1,2, ■ ■■ where we want to develop a 
regression model between F^'s and Xj's. In density regression the entire conditional density, 
and not just the conditional mean of Yi given Xi is modeled nonparametrically. Tokdar et al. 
(2010) consider the model Yi\Xi ~ g{-\Xi), independently across i, where the conditional 
densities g{-\x), x G are given by point by point logistic transforms of a function / : 
Urf X [0, 1] R: 

, , , q*(y) exp\ f(x,G*(y))} _ , , 

j_oo 9 yz) exp{/(x, G*{z))}dz 

for some fixed probability density g* on M with cumulative distribution function G*. To 
construct a suitable prior distribution for /, we consider an extension of the process W"''''''^. 

Let Z = {Z{t,u) : t G M"', n G [0,oo)) be a separable, zero mean Gaussian process with 
isotropic, square-exponential covariance function E.{Z{t,u)Z{s,v)} = exp(— ||t — — \\u — 



V 



|2^ 



Define Z-^'^-" = (Z'^'^''?(x, m) : x G Ud,M G [0, 1]) as 



Z"'^'«(x, u) = Z(diag(a6) ■ qx, au). (4) 

Let GPvs*{ai, a2, vr^) and GPlp*{ai, a2, tib, ttq), respectively, denote the laws of the processes 
ZA,B,Q ^Yieie {B, Q) are distributed as in (2) and A\^\+^\{B = b,Q)^ Ga{ai, aa). 
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Now suppose / in (3) is assigned the prior distribution Uj which is either GPvs*{ai, 02, ttb) 
or GPlp*{ai,a2,TrB,T^Q), and denote the induced prior distribution on g = {g{y\x) : x G 
Ud, ?/ G M) by Ilg. The corresponding posterior distribution based on {Xi, Yi), ■ ■ ■ , Yn) 
is given by 

^ {Ut,9iY^\X,m,{dg) 

imi.aiyrmwdgy 

Let Pg.(-,-) denote the metric p^^ (5(1, 5(2) = J{gl^^{y\x)-gl^^{y\x)YG^{dx) for a probabihty 
distribution Gx on U^. 

Consider a sequence (e„ > : n > 1) as before. For any fixed go = {go{y\x) : x G Ud,y £ 
M), we say the posterior converges at go at a rate e„ (or faster) if for some M > 0, 

phmn^({^ : PG.(^7,^o) > MeJ) = 

n— ^oo 

whenever FjlXj ~ go{-\Xi) and Xi G, independently across i > 1. 
Theorem 4. Lei = {go{y\x) ■ x G Ud,y G 1^) satisfy 

g*iy)exp{foix,G*iy))} 



9o{y\x) 



J g*{z) exp{fo{x,G*{z))}dz 



for an /o : x [0, 1] — )■ R that is Holder a-smooth. If fo{x,u) depends on x only through 
di < d many coordinates of x and 11/ = GPvs*{ai,a2,'n'B), the posterior converges at go 
at a rate e„ = n~"/'^^"'^'^^"'"-'^^(logn)'^ for every k > d + 2. Furthermore, if a > 1, fo{x,u) 
depends on x only through a rank-do linear projection Rx and Uf = GPlp*{ai,a2,7iB,'^Q), 
the posterior converges at go at a rate e„ = 7T,~°/'^^""^'^'^"^^^(logn)'^ for every k > d + 2. 



3 Adaptation properties of GP extensions 

Ghosal et al. (2000), later refined by Ghosal and van der Vaart (2007), provide a set of three 
sufficient conditions that can be used to establish posterior convergence rates for Bayesian 
non-parametric models for independent observations. One of these conditions relates to prior 
concentration at the true function, and the other two relate to existence of a sequence of 
compact sets which have relatively small sizes but receive large probabilities from the prior 
distribution. 

For the results stated in Theorems 1, 2 and 3 relating respectively to mean regres- 
sion, classification or density estimation, these three sufficient conditions map one to one 
(van der Vaart and van Zanten, 2008) to the following conditions on an extended GP W 
with law GPvs{ai,a2,T!'B) or GPlp{ai,a2,TTB,'n'Q) as appropriate, the true function fo and 
the desired rate e„: there exist sets C M'^'* and a sequence {e > : n > 1) with e„ < e„, 
lim„_!.oo ne^ = 00 such that for all sufficiently large n, 

P(||W^-/o||oo<e„)>e-"^", (5) 
P{W^Bn)<e-^^'"-, (6) 
\ogN{en,BnA\-\\oo)<nel (7) 
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where A^(e, B, p) denotes the minimum number of balls of radius e (with respect to a metric 
p) needed to cover a set B. For the density regression results stated in Theorem 4, the 
sufficient conditions also map one to one to the above but with W now following either 
GPvs*{ai, a2, tib) or GPlp*{ai, a2, ttb, ttq) and with Bn C M^<*^[°'^1. This can be proved along 
the lines of Theorem 3.2 of van der Vaart and van Zanten (2008), by looking at the joint 
density of (Xj, Yj) determined by and g Ug. 

We verify these conditions in the following subsections by extending the calculations 
presented in van der Vaart and van Zanten (2009). A fundamental ingredient of these cal- 
culations is the reproducing kernel Hilbert space (RKHS) associated with a Gaussian process. 
The RKHS of W is defined as the set of functions /i : M*^ M that can be represented as 

h{t) = E{W{t)L} for some L in the closure of {V = aiW{ti) H h akW{tk) : k>l,ai e 

M, ti G M'^}. Similar definitions apply to the processes W"''^''^ and Z*^'^'"^ with domains Vd and 
Ud X [0, 1] instead of R'^. 

In Lemma 4.1 of van der Vaart and van Zanten (2009), the RKHS of W is identified as 
the set of functions h such that h{t) = Re{J e'^^'^'^^ipWdp{X)} for some ip G L2{p), where 
Re{z} denotes the real part of a complex number z, i is the square root of —1 and p is the 
(unique) spectral measure on M.^ of W, satisfying E{W^(t)iy(s)} = / e~^^^~^'^^dii{X). For 
the isotropic, square exponential GP W, the spectral measure is the d-dimensional Gaussian 
probability measure with mean zero and variance matrix 2/^. The RKHS norm of such an 
h is precisely ||i/^||l2(;.). 

By simple change of variables it follows that the RKHS of W"''^''^, for any a > 0, 6 G {0, l}"^ 
and q E Od, is given by functions h such that h{x) = Re{f e''-^'''^^{X)dpa,bM} with RKHS 
norm || || LaC^a i, where pa,b.q is the d-dimensional Gaussian probability measure with mean 
and variance matrix 2a^g'diag(6)g. In the rest of the paper this RKHS is denoted EI"''''^ 
and H"''''^ is used to denote its unit ball at the origin. Also, B is used to denote the Banach 
space of continuous functions on equipped with the supremum norm || ■ ||oo- The unit ball 
at origin of this space is denoted Bi. 

3.1 Variable selection extension 

To start with let W^'^ denote W"''^''^ with q fixed at the d- dimensional identity matrix and let 
H"'^ stand for the corresponding RKHS H"''*'''. For any b G {0, l}*^ and x G let Xb denote 
the |6|-dimensional vector of coordinates of x selected by b. Also for any a > and any 
d E {0, ■ ■ ■ ,d} let H^^ denote the class of Holder a-smooth, real functions on Uj. Notice 
that if fo{x) depends only on di many coordinates of x then there exist bo G {0, l}'' with 
\bo\ = di and Vq G Ha^di such that fo{x) = Vo{xbo) for all x G U^. 

Theorem 5. Let /o : — > M satisfy fo{x) = voi^Xbo) for some bo G {0, 1}'' with \bo\ = di 
and some Vo G Ha^di ■ Then for every s > 0, there exist measurable subsets Bn C MP"^ and a 
constant K >0 such that (5)- (7) hold with W = W^'^ , in = n-"/(2"+^i)(logn)('^+^)/2+s 
e„ = ire„(logr2)('^+i)/2. 

Proof Define W;^^ = {W;^^{u) : u G Udo) by W;^^{u) = iy'^'''o(u^«) where denotes the 
unique zero-insertion expansion of n to a d- dimensional vector such that {u'"^)bo = u. For 
any u G Uc;„, for every x G with Xb^ = u we have W^'^^i^x) = {u) and fo{x) = Vo{u). 
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So 

P(||pyAB _ j^ii^ < > ^^(bo)Pi\\W,'^^ - voWoo < en). 

From calculations presented in Section 5.1 of van der Vaart and van Zanten (2009) it follows 
^(llW^b^-f^olloo < 5n) > exp{-nSl) for 5„ a large multiple of n-"/(2"+l*ol)(logn)(i+l^«l)/(2+l^'ol/") 
and all sufficiently large n. This leads to (5) because in is larger than any such 5„ by a power 
of log n. 

It follows from the proof of Theorem 3.1 of van der Vaart and van Zanten (2009) that 
for some Cq > 0, Oq > 1, < eo < 1/2 and for every b G {0, l}'^ \ {0}, r > Oq, < e < Cq, 
M2 > Corl^l(log(r/e))i+l^l and 5 = €/{2\b\^/^M) the set 

BM,r,e,s = {{r/6f\/'MEf/ + eBi} U (U„<,MH^'' + eBi) 

satisfies 

P{W^'' ^ B\,^^^J < Cir(«^-i)l^l+ie-^^''"" + e-^^'/« (8) 

(/\#3/2 /2j/T37v\ M 
log ^ j +2 log-, (9) 

for universal constants Ci, C2, C3. For any b G {0, Ij^'y {0} define B^^ = B\^^ where rif' 

is a large multiple of ne^, is a large multiple of ne^(logn)^"'"l''l and 5„ = e„/(2|6|^/2M„). 
Then by the above inequalities, P{W^'^ ^ B^) < ex^i-C^nll) and \ogN{Ken,Bl, \\ ■ \\^) < 
C5ne^(logn)^"'"l''l for some large constants C4, C5 and K. It is easy to construct a with 
similar properties when 6 = 0. Therefore (6), (7) hold with Bn = Ubg{o,i}di3^. □ 

Corollary 1. Let /o : x [0,1] — )■ M satisfy fQ{x,u) = VQ{xbo,u) for some bo G {0,1}°' 
with \bo\ = di and some Holder a-smooth Vq : 1]^^ x [0, 1] — ?■ M. Then for every s > 0, there 
exist measurable subsets Bn C M^<*^[°'^1 and a constant K > such that (5)-(7) hold with 

^ = ZA,B,I,^ ^ ^_«/(2a+di+l)(jQg^^(d+2)/2+s _ i^g„(log Tl) ('^+2)/2 . 

Proof. A proof can be constructed exactly along the lines of the proof above. The extra 
variable does not alter calculations, except for increasing all dimensions by one, because the 
variable selection parameter does not operate on it. □ 

3.2 Linear projection extension 

Our proof of Theorem 5 is made relatively straightforward by the fact that B lives on a 
discrete set. This is no longer the case when we work with W"^'^'^ with Q taking values on 
a continuum. However the support of Q, namely Od is a well behaved compact set, a fact 
that we make good use of. To start with, here is a result that shows how to relate the RKHS 
^a,b,q ^^^Yi the RKHS EI"''''^' when q,qeOd are close to each other. 

Lemma 1. For any a > 0, 6 G {0, l}'^ and q,q E Od, H"'^''^ C H"'^''^ + aVd\\q — gH^Bi where 
II ■ II 5 denotes the spectral norm on Od- 
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Proof. Any h G H"''''^ can be expressed as h{x) = Re{f e*^'^'^V('^)'^/^a,b,g(A)} for some ^ 
L2{^a,b,q) with norm no larger than 1. For any x, 2 G U^, 

\h{x)-h{z)\< j |(A,x - z)\\^{\)\d^ia,b,q{^) < Ik " S ||APc?/ia,fe,g(A) < a^/d\\x - z\\ 

where the second inequality follows from two applications of the Cauchy-Schwartz inequality. 

Define h{x) = h{q'qx) and '?/'(A) = ip{q'q\). Then ip G L2{fJ,a,b,q) with norm no larger 
than 1 and h{x) is the real part of 

j e^(''^'^"^V(A)rf/ia,M(A) = j e'^^'''^''^4>{q'qX)df,aMW = j e'^''^^{X)dfiaAdX), 

and therefore h G H"'^'^. From this the result follows because for any x G U^, \h{x) — h{x)\ < 
a\/d\\x — q'qx\\ < a\/d\\q — q\\s- □ 

Next we present the counterpart of Theorem 5 for the linear projection extension. Notice 
that if fo{x) depends on x only through a rank-do linear projection Rx then there exist 
60 e {0, l}"* with |6o| = c^o, ^0 e Od and vq G Ha,do such that /o(x) = vo{{qox)bo). 

Theorem 6. Let /o : — M satisfy fo{x) = vo{{qox)bo) for some go ^ Od, bo G {0, l}"^ 
with \bo\ = do and some vq G Ha,do with a > 1. Then for every s > 0, there exist measurable 
subsets Bn C M^^ and a constant K > Q such that (5)-(7) hold with W = W^'^'^, e„ = 

^-a/(2a+do)Qog^)(d+l)/2+s _ iTe^ (log n) ('^+^)/^ 

Proof. Clearly, 

P(||^AB>Q - /oiloo < en) > TiBipo) [ PiWW''"''^ - /oiloo < en)7iQ{q)dq. 

For any q G Od, define W^^^^ = {W^^giu) : u G U^J and /, : ^ M by W,%{u) = 
W°''^°''^{q'u^°) and fq{x) = Vo{{qx)bo). For any u G U^o and every x G with {qx)bg = u we 
have W'''^''''^{x) = W^^^^iu) and fg{x) = vo{u). Now, if q E Od is such that ||/o - /g||oo < en/2 
then 

P(\\W^M>,<i _ j^ii^ < g^) > p(|||yA,feo,g _ jji^ < gj2) = P(||l^,;,, - ^^olloo < en/2). 

This last probability does not depend on q because W is rotationally invariant and hence 
equals P(||Wfe'; ,jg-i;o||oo < e„/2) > e"'"^" with (5„ a multiple of n-"/(2"+|bo|)(iog^)(i+|6o|)/(2+|feo|/a)^ 
as in the previous theorem. From this (5) would follow if we can show P{Q G {q : 
ll/o - fqWoo < en/2}) > e-"^". Note that = foiq'oQx) for all x G U^. By assump- 

tion on t>o, /o has a bounded continuous derivative and hence ||/o — /g||oo < D2\\qo — q\\s 
for some D2 < 00. But -P(||(5 — QoWs < en/{2D2)) > D^e}!''^'^^^'^ for some constant Z^s be- 
cause a spectral ball of radius 6 in has volume of the order for all small S > 
and TiQ is strictly positive on Od- This completes the proof of the first assertion because 
~did-i)/2 ^ ^^n5l as -loge„/(n52) ^ 0. 
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To construct the sets i3„ we adapt the approach taken in the previous theorem to include 
q. In particular, for each 6 G {0, 1} \ {0} and each q G Od, define 

^M,r,e,s = {(r/5)l'l/2Me^'''^ + ell} U {Ua<sMmf^' + ell). 

Inequalities (8) and (9) continue to hold with ^ ^ ^ and W^'^ respectively replaced with 
^Aie e <5 ^^'^ ly^'^''?. Therefore by defining B^'^ = ^ ^ ^ with M„, r„ and (5„ exactly as be- 
fore we have PiW^^^^'^ ^ B^) < expi-C^nel) and log iV(i^e„, || ■ ||oo) < Cgne^ (logn)i+l^l 
for some finite constants C4, C5 and i^. 

Fix any b G {0, iy\{0} and take = Ug^oM'- Then P(1^^'^'Q ^ S^) < expi-C^nel). 
To bound the entropy of B^, first get an ^„-spectral norm net Q„ of Od where Cn = 
e„,/{M„r„A/d(r„/(5„)l''l/^}. The size of Q„ is no larger than CqCu'^^'^"^^^'^ for some univer- 
sal constant Cq. For any q E Od find q G Q„ such that — ^Hs < C- Then by Lemma 1, 
B^^'^ C B^/ + e„Bi and hence, 

logiV((K + l)e„,i3^, II ■ lU) < logmaxiV(A'e„,i3;;'^ || ■ lU) + log | Qn| 

q&Qn 

which is smaller than C7ne^(log n)^"''!^! for some constant C7. Therefore (6), (7) hold with 

Bn = Ubg{o,l}di3^. 

□ 

Corollary 2. Let /o : x [0, 1] — )■ M satisfy /o(x,n) = VQ{{qox)ha,u) for some qo G Od, 
bo G {0, l}'^ with \bo\ = di and some Holder a-smooth vq : U^g x [0, 1] — ?■ M. Then for every 
s > 0, there exist measurable subsets Bn C M'^'^^P'^l and a constant K > such that (5)-(7) 

hold with W = Z^.^'Q, g„ = ^-«/(2a+do+l)QQg^)(d+2)/2+. _ (log (-^+2)72^ 

Proof. Again, the additional variable is unaffected by the projection parameter which oper- 
ates only on the x variable. So the above proof can be extended almost verbatim to prove 
this result. □ 



4 Discussion 

Besides variable selection and linear projection, another common way of extending Gaussian 
processes is to equip them with a vector of rescaling parameters, each operating along a 
single axis (Williams and Rasmussen, 1996). In our notations, this could be defined as 
W" = (Wix) : X e Vd) with Wix) = W{diag{a) ■ x), for a G [0, cx))'^. The law of 
W^, with A assigned some prior distribution it a, can be used as a prior distribution for 
a function valued parameter / : — ?■ M. In an independent work, done in parallel to 
ours and posted on arXiv.org, Bhattacharya et al. (2011) explore and establish some very 
interesting adaptability properties of such extensions. From a practitioner's point of view, 
the most interesting extension along this line would be W"^'^ with [A, Q) ~ vTyi x ttq where 
for any a G [0, 00), q G Od, one defines W°'''^{x) = iy(diag(a) ■ qx). One should be able to 
study theoretical properties of such an extension by combining the results presented in this 
paper and in Bhattacharya et al. (2011), but the details remain to be verified. 
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Note that we restrict to functions /o that are only finitely different iable. For /q that are 
infinitely differentiable and satisfy some regularity conditions, a rescaled GP model offers 
a nearly parametric posterior convergence rate of n~^/^(logn)'^ for some k that depends on 
d and /o (van der Vaart and van Zanten, 2009). The dimension does not affect the leading 
term ra"^/^ and our techniques in Section 3 do not offer improvement in the logarithmic 
factor. 

It might seem a little underwhelming that for density regression our choice of metric 
essentially defines the Hellinger metric between the joint densities hi{x,y) = 
gi{y\x), and h2{x,y) = g2{y\x), with respect to the product of and the Lebesgue measure 
on M, thus transporting the problem to one where one studies the joint density of {X, Y). We 
make two observations to point out why this is not a terrible thing to do. First, no modeling 
is done on the unknown distribution Gx, the joint density view is purely a technical tool 
needed to map conditions required to prove Theorem 4 to conditions (5)-(7). Second, the 
goals of regression are well preserved despite the use of G^ in defining pc^ ■ Suppose one is 
interested in inference on g{y\x*) for test data x* generated from G*, possibly different from 
Gx- For this task, a more useful metric is given by pG*{9i, 92), defined in the same way as 
Pg^ but with G* in place of Gx- But an e„ rate of convergence in pc^ also implies an e„ rate 
of convergence in pc* as long as G* is absolutely continuous with respect to Gx- Absolute 
continuity is unavoidable because one can hope to make accurate prediction only at points 
where data accumulate. 

A related issue is the debate whether density regression can be essentially carried out by 
a nonparametric estimation of the joint density of {X,Y), such as in Miiller et al. (1996). 
Our results indicate that if inference on the conditional densities of Y given X is of interest, 
then there might indeed be an advantage in pursuing the density regression formulation with 
the potential of obtaining much faster convergence rates through a suitable projection of X. 
From a practitioner's point of view, this would mean sharper inference, with shorter credible 
bands for the same amount data than what one would obtain in the joint density estimation 
formulation. 
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